[zephyr/tokenize] Use bulk list-objects for file sizes, delete filescan job#4658
[zephyr/tokenize] Use bulk list-objects for file sizes, delete filescan job#4658
Conversation
…-file stats Replace per-file fsspec stat RPCs with fsspec glob(detail=True), which returns file sizes from the same list-objects API call that discovers files — zero extra RPCs needed. Zephyr changes: - Dataset.from_files() now stores a lazy GlobSource instead of eagerly globbing. Resolution happens at plan time in compute_plan(). - resolve_glob() uses glob(detail=True) to produce FileEntry objects (path + size) from the bulk listing response. - InputFileSpec gains a size field, populated from the bulk listing. Tokenize changes: - Delete the tokenize-filescan Zephyr job (introduced in #4341) that launched 32 distributed workers just to stat files one at a time. - Replace it with _glob_with_sizes() which gets sizes for free from the same glob call. Works for gs://, hf://, s3://, and local paths. - Remove dead InputName/ExecutorStep early-return branch from _get_filepaths_to_tokenize. Benchmark on nemotron hq_actual (2,755 files, 1 TB): glob(detail=True): ~2s (same cost as plain glob) Previous approach: 32-worker Zephyr job doing individual stats Part of #4411, part of #4587
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 98604d2e02
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@claude strip provenance.json from the validation/test sets as well, as per PR feedback |
|
Claude finished @rjpower's task in 2m 43s —— View job Done — provenance.json filter applied to validation files
Changetrain_file_stats = _glob_with_sizes(train_patterns)
train_file_stats = [f for f in train_file_stats if "provenance.json" not in f["filename"]]
validation_file_stats = _glob_with_sizes(validation_patterns)
validation_file_stats = [f for f in validation_file_stats if "provenance.json" not in f["filename"]] # newOne-line mirror of the existing train filter in |
The bulk glob flow previously stripped provenance.json only from train files, so directory-based validation inputs could include metadata JSON in the tokenization stream. Apply the same filter to validation files. Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
| output_paths.extend(fsspec_glob(os.path.join(path, f"**/*.{ex}"))) | ||
| else: | ||
| output_paths.extend(fsspec_glob(path)) | ||
| _TOKENIZE_EXTENSIONS = ["json.{gz,zst,zstd}", "jsonl.{gz,zst,zstd}", "parquet", "json"] |
There was a problem hiding this comment.
is plain .json safe here?
There was a problem hiding this comment.
we can probably remove, probably less safe than useful i guess, too many chances for weird things
| if not train_paths and not validation_paths: | ||
| # Resolve patterns → concrete files with sizes (single list-objects call per pattern) | ||
| train_file_stats = _glob_with_sizes(train_patterns) | ||
| train_file_stats = [f for f in train_file_stats if "provenance.json" not in f["filename"]] |
There was a problem hiding this comment.
"provenance.json" special-cases seems brittle
There was a problem hiding this comment.
yeah i'll figure out a more appropriate setup
There was a problem hiding this comment.
the old code has this as well. I don't see a great way around it unfortunately.
we should really be writing our own metadata into isolated directories like dir/.marin/provenance.json, but of course we have a lot of leftover datasets with this in it.
I changed it to use a constant set of known metadata files & filter those out (and we don't look for plain .json anymore which should avoid most of the issues...)
|
|
||
| path: str | ||
| format: Literal["parquet", "jsonl", "vortex", "auto"] = "auto" | ||
| size: int | None = None |
There was a problem hiding this comment.
docstring not updated. size here feels a bit off, InputFileSpec would imply the "specification to read the data", isn't size here a metadata and not a specification?
There was a problem hiding this comment.
moved off of the filespec
….json auto, named sidecar filter - InputFileSpec is now a pure read-spec (size field removed) - FileEntry holds spec + size; readers see InputFileSpec only, planners still size shards via FileEntry.size - _TOKENIZE_EXTENSIONS no longer auto-globs plain .json (avoids matching sidecars) - _MARIN_SIDECAR_NAMES + _drop_sidecars replaces brittle substring filter; applied uniformly to train and validation - Raise when a configured split (train or validation) resolves to zero files (codex P1)
|
|
||
| def _glob_with_sizes(patterns: list[str]) -> list[dict]: | ||
| """Glob patterns and return [{"filename": path, "size": bytes}]. | ||
| # NOTE(chris): Marin's `default_download` writes a `provenance.json` sidecar next to |
| class InputFileSpec: | ||
| """Specification for reading a file or portion of a file. | ||
|
|
||
| Pure read-spec: everything here is caller-supplied. Discovered metadata |
There was a problem hiding this comment.
I dislike these kinda of side-effect comments from claude :/

What
Replace per-file
fsspec_size()stat RPCs withfsspec.glob(detail=True), which returns file sizes from the same list-objects API call that discovers files. Zero extra RPCs.This eliminates the
tokenize-filescanZephyr job (#4341) that launched 32 distributed workers just to stat files one at a time. On nemotron hq_actual (2,755 files, 1 TB),glob(detail=True)takes ~2s — the same cost as a plain glob.Zephyr changes
Dataset.from_files()now stores a lazyGlobSourceop instead of eagerly globbing at construction timecompute_plan()resolvesGlobSource→FileEntryobjects (path + size) viaresolve_glob()InputFileSpecgains asizefield populated from the bulk listing_compute_file_pushdownnow takeslist[FileEntry]instead oflist[str]Tokenize changes
tokenize-filescanZephyr job and all its machinery (fsspec_size, batched stat workers)_glob_with_sizes()— takes a list of patterns, returns[{"filename", "size"}]usingdetail=True_expand_tokenize_paths()— directory → recursive extension globsTokenizeConfigandHfTokenizeConfigpaths now go through the same glob → bundle flowInputName/ExecutorStepearly-return branchBenchmark
detail=Trueworks identically forgs://,hf://,s3://, and local filesystems.Part of #4411, part of #4587